Distributed Data Analytics Systems
Hadoop
Drawbacks of traditional distributed Framework,Why this?
Drawback:
- Data exchange requires synchronization
- Difficult to cope with partial system failure
Why Hadoop:
- Reliability: handle partial failures
- Scalability: Automatically scales to more computing nodes
- Programmability: written in high-level code
How HDFS works?
When a client application wants to read a file, it communicates with the name node to determine which blocks make up the file, and which datanodes those blocks reside in.Then it communicate directly with the datanodes to read the data.
MapReduce Execution:
- Pre-loaded local input data
- Intermediate data from mappers
- Values exchanged by shuffle process
- Reducing process generates outputs
- Outputs stored in HDFS
MapReduce bottleneck:
Problem: huge data transfer takes lot of time in shuffle step.
Solution:Hadoop will start transfer data from mappers to reduces as the mappers finish work
Problem: straggler problem exist indeed, while no reducer can start before every mapper has finished
Solution: Hadoop uses speculative execution, specifically, if a mapper appears to be running significantly more slowly than others, a new instance of the mapper will start on another machine, operating same data, the first result will be used and the running mapper will be killed
Problem: Data must be passed to reducer, which result in a lot of network traffic
Solution: Combiner, like a “mini-reduce”, runs locally on single mapper’s output, and the codes are often identical with reducer.
Problem: potential performance issues or secondary sort is needed.
Solution: Write you own Custom Partitioners
HaLoop
Drawbacks of traditional distributed Framework,Why this?
Drawbacks:
Hard to handle recursive program, for example: Graph analytics, machine learning, data mining or some recursive queries. mapreduce: Load and Shuffle data on each iteration
Why HaLoop:
TaskTracker (Cache management)
Scheduler (Cache awareness)
Programming model (multi-step loop bodies, cache control)
It is a efficient common runtime for recursive languages: Map, Reduce, Fixpoint.
Solution:
Inter-iteration caching:
- Mapper input cache (MI)
- Reducer input cache (RI)
- Reducer output cache (RO)
RI - Reducer Input Cache:
Access to loop invariant data without map/shuffle, used by reducer function.
RO - Reducer Output Cache:
Distributed access to output of previous iteration, used by fixpoint evaluation
MI - Mapper Input Cache:
Access to non-local mapper input on later iterations, used during scheduling of map tasks.
Architecture:
- Loop Control
- Caching
- Indexing
FlumeJava
Drawbacks of traditional distributed Framework,Why this?
When meet long and complicated data-parallel pipelines, it is difficult to program and manage, besides each mapreduce job needs to keep intermediate results, what’s more, high overhead at synchronization barrier between different mapreduce jobs.
Why flume?
Expressiveness
Abstractions
Performance (lazy evaluation and Dynamic optimization)
Usability & deployability (implemented as a java library)
Optimization:
- Sink flatten
- ParallelDo fusion
- MSCR fusion
Dryad
Drawbacks of traditional distributed Framework,Why this?
General-purpose execution engine for coarse-grained data-parallel applications
Easy to write simple programs, execution engine automatically manages scheduling, distribution, FT, etc.
Why Dryad?
Job = Directed Acyclic Graph
Computational “vertices” connected by communication “channels”(edges)
What GDL (Graph Description Language)?
A lower-level programming model than SQL
Architecture?
- Job Manager
- Name Server
- Daemons
Spark
Drawbacks of traditional distributed Framework,Why this?
complex applications
interactive ad-hoc queries
Reuse of intermediate results across multiple computatios
RDD (Resilient Distributed Datasets)?
- Restricted form of distributed shared memory, only be built through coarse-grained deterministic transformations
- Fault recovery using lineage (Log transformations used to build a dataset, log enough info how it was derived from other RDDs)
RDD good for:
Apply the same operation to all elements of a dataset (coarse-grained operation)
Remember each transformation as one step in a lineage graph
Recovery of lost partitions without having to log large amounts of data
Not good for: asynchronous fine-grained updates to shared state
Task Scheduler:
Dryad-like DAGs
Naiad
Drawbacks of traditional distributed Framework,Why this?
Iterative processing on streaming data, interactive queries on a fresh, consistent view of the results.
Whay Naiad?
A new computational model: timely dataflow
Solusion:
iteractive and incremental computations : Structured loops allowing feedback in the dataflow, stateful dataflow vertices capable of consuming and producing records without global coordination
producing consistent results -> notifications for vertices once they have received all records for a given round of input or loop iteration.
Key point:
Timestamp:
in the graph, every stateful vertices receive timestamped message along directed edges.
In nested cycle, use timestamp to distiguish data in different input and loop iterations
Two methods:
Supports asynchronous and fine-grained synchronous execution
- Batching: sychronous, one-to-one correspondence between input and output
- Streaming: asychronous, overlapping computation (latency is low)
Low latency?
- programming model: Asynchronous and fine-grained synchronous execution.
- Distributed progress tracking protocol: enables processes to deliver notifications promptly.
Husky
Drawbacks of traditional distributed Framework,Why this?
High performance, flat learning curve, good reusability, low maintenance cost and high compatibility
Why husky?
A new computational model that makes Husky general and expressive
Architecure?
Master-Worker architecture
master:
keeps worker information and data partitioning scheme
Does not sit on the data path and don’t compute
coordinates work among workers and monitors the progress of workers
Worker:
Read/write data, communicate with other workers. compute in parallel
Send heartbeat to master periodically
Implementation:
Channel-based messaging subsystem -> makes streaming computation posible
Store attribute lists as in a column-store
Better locality, more oppotunity to optimize (vectorization). Adding attributes without recompiling, useful for interactive data analysis.